Assignment 3-2: Loan Approval Prediction¶

Kelsey Liu¶

2023-02-07

02. Loan Project¶

Step 1: Clean and prepare your data: There are several entries where values have been deleted to simulate dirty data. Please clean the data with whatever method(s) you believe is best/most suitable. Note that some of the missing values are truly blank (unknown answers) and thus may be impossible to clean; use your discretion.

Step 2: Build your models: Please build machine learning/statistical models in Python to predict the interest rate assigned to a loan. When writing the code associated with each model, please have the first part produce and save the model, followed by a second part that loads and applies the model.

Step 3: Test your models using the data found within the "Holdout for Testing" file. Save the results of the final model (remember you will only predict the first column in holdout test set with your best model results) in a single, separate CSV titled "Results from" *insert your name or UChicago net ID.

Step 4: Submit your work: Please submit all of your code for cleaning, prepping, and modeling your data, your "Results" file, a brief write-up comparing the pros and cons of the modeling techniques you used (no more than a paragraph). Your work will be scored on techniques used (appropriateness and complexity), model performance - measured by RMSE - on the data hold out, an understanding of the techniques you compared in your write-up, and your overall code.

Data Dictionary

  1. X1: Interest Rate on the loan (y, target variable)
  2. X2: A unique id for the loan
  3. X3: A unique id assigned for the borrower
  4. X4: Loan amount requested
  5. X5: Loan amount funded
  6. X6: Investor-funded portion of loan
  7. X7: Number of payments (36 or 60)
  8. X8: Loan grade
  9. X9: Loan subgrade
  10. X10: Employer or job title (self-filled)
  11. X11: Number of years employed (0 to 10; 10 = 10 or more)
  12. X12: Home ownership status: RENT, OWN, MORTGAGE, OTHER
  13. X13: Annual income of borrower
  14. X14: Income verified, not verified, or income source was verified
  15. X15: Date loan was issued (Format: Mon-YY)
  16. X16: Reason for loan provided by borrower
  17. X17: Loan category, as provided by borrower
  18. X18: Loan title, as provided by borrower
  19. X19: First 3 numbers of zip code
  20. X20: State of borrower
  21. X21: A ratio calculated using the borrower's total monthly debt payments on the total debt obligations, excluding mortgage and the requested loan, divided by the borrower's self-reported monthly income
  22. X22: The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years
  23. X23: Date the borrower's earliest reported credit line was opened (Format: Mon-YY)
  24. X24: Number of inquiries by creditors during the past 6 months
  25. X25: Number of months since the borrower's last delinquency
  26. X26: Number of months since the last public record
  27. X27: Number of open credit lines in the borrower's credit file
  28. X28: Number of derogatory public records
  29. X29: Total credit revolving balance
  30. X30: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit
  31. X31: The total number of credit lines currently in the borrower's credit file
  32. X32: The initial listing status of the loan. Possible values are - W, F!

01. Import necessary library and Load data (Data for Cleaning & Modeling.csv)¶

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split

# Importing Classifier Modules
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
plt.style.use("ggplot")
%load_ext lab_black

# import warnings
# warnings.filterwarnings("ignore")
In [3]:
df = pd.read_csv("Data for Cleaning & Modeling.csv")
df.head(3)
/var/folders/08/336_pmn17t56r5jjv51n_8840000gn/T/ipykernel_64897/2781832233.py:1: DtypeWarning: Columns (15) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("Data for Cleaning & Modeling.csv")
Out[3]:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32
0 11.89% 54734.0 80364.0 $25,000 $25,000 $19,080 36 months B B4 NaN < 1 year RENT 85000.0 VERIFIED - income Aug-09 Due to a lack of personal finance education an... debt_consolidation Debt consolidation for on-time payer 941xx CA 19.48 0.0 Feb-94 0.0 NaN NaN 10.0 0.0 28854.0 52.10% 42.0 f
1 10.71% 55742.0 114426.0 $7,000 $7,000 $673 36 months B B5 CNN < 1 year RENT 65000.0 not verified May-08 Just want to pay off the last bit of credit ca... credit_card Credit Card payoff 112xx NY 14.29 0.0 Oct-00 0.0 NaN NaN 7.0 0.0 33623.0 76.70% 7.0 f
2 16.99% 57167.0 137225.0 $25,000 $25,000 $24,725 36 months D D3 Web Programmer 1 year RENT 70000.0 VERIFIED - income Aug-14 Trying to pay a friend back for apartment brok... debt_consolidation mlue 100xx NY 10.50 0.0 Jun-00 0.0 41.0 NaN 10.0 0.0 19878.0 66.30% 17.0 f
In [4]:
# df.isna().sum()
In [5]:
# For Data Exploration
# df.X__.value_counts()
# df.X__.value_counts().count()
# df.X__.unique()

df.X12.value_counts()
Out[5]:
MORTGAGE    172112
RENT        136778
OWN          29588
OTHER          124
NONE            36
ANY              1
Name: X12, dtype: int64
In [6]:
columns_subset = [
    "X1",
    "X4",
    "X5",
    "X6",
    "X7",
    "X8",
    "X11",
    "X12",
    "X13",
    "X14",
    "X15",
    "X17",
    "X20",
    "X21",
    "X22",
    "X23",
    "X24",
    "X25",
    "X26",
    "X27",
    "X28",
    "X29",
    "X30",
    "X31",
    "X32",
]
# 'X2', 'X3', Id is not useful in prediction
# 'X9', Decide to use X8 instead of X9 (more information)
# 'X10' - Employer or job title (self-filled), relatively unclear and contains 23986 missing records, better dropped.
# 'X16' (Reason for loan) - Too messy and too many missing records, exclude for now.
# 'X18' - self filled and similar to 'X17', can be dropped and use 'X17' instead.
# 'X19' - Zip code first 3 digits provides just a little bit more info than state, can be dropped and use 'X20' instead/

df = pd.read_csv("Data for Cleaning & Modeling.csv", usecols=columns_subset)
In [7]:
column_rename = {
    "X1": "Interest_Rate",
    "X4": "Loan_Requested",
    "X5": "Loan_Funded",
    "X6": "Investor_portion",
    "X7": "Number_of_Payments",
    "X8": "Loan_Grade",
    "X11": "Years_Employed",
    "X12": "Home_Ownership",
    "X13": "Annual_Income",
    "X14": "Verification",
    "X15": "Issued_MonYY",
    "X17": "Loan_Category",
    "X20": "State",
    "X21": "Debit_Payment_Ratio",
    "X22": "Delinquency_Frequency",
    "X23": "Credit_Earliest_MonYY",
    "X24": "Inquiries_Count",
    "X25": "Delinquency_Recency",
    "X26": "Public_Record_Recency",
    "X27": "Open_Credit_Line_Count",
    "X28": "Derogatory_Public_Record_Count",
    "X29": "Total_Credit",
    "X30": "Credit_Utilization_Rate",
    "X31": "Total_Credit_Line_Count",
    "X32": "Initial_Status",
}

df = df.rename(columns=column_rename)

02. Data Cleaning and Preprocessing¶

In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 25 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Interest_Rate                   338990 non-null  object 
 1   Loan_Requested                  399999 non-null  object 
 2   Loan_Funded                     399999 non-null  object 
 3   Investor_portion                399999 non-null  object 
 4   Number_of_Payments              399999 non-null  object 
 5   Loan_Grade                      338730 non-null  object 
 6   Years_Employed                  382462 non-null  object 
 7   Home_Ownership                  338639 non-null  object 
 8   Annual_Income                   338972 non-null  float64
 9   Verification                    399999 non-null  object 
 10  Issued_MonYY                    399999 non-null  object 
 11  Loan_Category                   399999 non-null  object 
 12  State                           399999 non-null  object 
 13  Debit_Payment_Ratio             399999 non-null  float64
 14  Delinquency_Frequency           399999 non-null  float64
 15  Credit_Earliest_MonYY           399999 non-null  object 
 16  Inquiries_Count                 399999 non-null  float64
 17  Delinquency_Recency             181198 non-null  float64
 18  Public_Record_Recency           51155 non-null   float64
 19  Open_Credit_Line_Count          399999 non-null  float64
 20  Derogatory_Public_Record_Count  399999 non-null  float64
 21  Total_Credit                    399999 non-null  float64
 22  Credit_Utilization_Rate         399733 non-null  object 
 23  Total_Credit_Line_Count         399999 non-null  float64
 24  Initial_Status                  399999 non-null  object 
dtypes: float64(10), object(15)
memory usage: 76.3+ MB
In [9]:
df.head(3)
Out[9]:
Interest_Rate Loan_Requested Loan_Funded Investor_portion Number_of_Payments Loan_Grade Years_Employed Home_Ownership Annual_Income Verification Issued_MonYY Loan_Category State Debit_Payment_Ratio Delinquency_Frequency Credit_Earliest_MonYY Inquiries_Count Delinquency_Recency Public_Record_Recency Open_Credit_Line_Count Derogatory_Public_Record_Count Total_Credit Credit_Utilization_Rate Total_Credit_Line_Count Initial_Status
0 11.89% $25,000 $25,000 $19,080 36 months B < 1 year RENT 85000.0 VERIFIED - income Aug-09 debt_consolidation CA 19.48 0.0 Feb-94 0.0 NaN NaN 10.0 0.0 28854.0 52.10% 42.0 f
1 10.71% $7,000 $7,000 $673 36 months B < 1 year RENT 65000.0 not verified May-08 credit_card NY 14.29 0.0 Oct-00 0.0 NaN NaN 7.0 0.0 33623.0 76.70% 7.0 f
2 16.99% $25,000 $25,000 $24,725 36 months D 1 year RENT 70000.0 VERIFIED - income Aug-14 debt_consolidation NY 10.50 0.0 Jun-00 0.0 41.0 NaN 10.0 0.0 19878.0 66.30% 17.0 f

02-1.Some simple datatype change

In [10]:
# Data type - From "String" To "Float"

## Rate - remove % and change to float type
for col in ["Interest_Rate", "Credit_Utilization_Rate"]:
    df[col] = df[col].str.slice(stop=-1).astype("float")

## Dollar Amount - remove $ and change to integer type
for col in ["Loan_Requested", "Loan_Funded", "Investor_portion"]:
    df[col] = df[col].str.slice(start=1).str.replace(",", "").astype("float")

# Data type - To Category
## Number of Payments - only 2: 36 months/ 60 months
# Change to 'Payments_is_36' column with 2 value: (1, 0)
df["Payments_is_36"] = (df["Number_of_Payments"] == " 36months").map(
    {False: 0, True: 1}
)
df = df.drop("Number_of_Payments", axis=1)

## Simply change to Category
Category = [
    # "Loan_Grade", - Need further processing
    "Years_Employed",
    # "Home_Ownership", - Need further processing
    "Verification",
    "Loan_Category",
    "State",
    "Initial_Status",
]
for col in Category:
    df[col] = df[col].astype("category")

02-2.Explore the data type and missing values

In [11]:
# df.info()
df.isna().sum()
Out[11]:
Interest_Rate                      61010
Loan_Requested                         1
Loan_Funded                            1
Investor_portion                       1
Loan_Grade                         61270
Years_Employed                     17538
Home_Ownership                     61361
Annual_Income                      61028
Verification                           1
Issued_MonYY                           1
Loan_Category                          1
State                                  1
Debit_Payment_Ratio                    1
Delinquency_Frequency                  1
Credit_Earliest_MonYY                  1
Inquiries_Count                        1
Delinquency_Recency               218802
Public_Record_Recency             348845
Open_Credit_Line_Count                 1
Derogatory_Public_Record_Count         1
Total_Credit                           1
Credit_Utilization_Rate              267
Total_Credit_Line_Count                1
Initial_Status                         1
Payments_is_36                         0
dtype: int64
In [12]:
df[df["Loan_Requested"].isna()]
Out[12]:
Interest_Rate Loan_Requested Loan_Funded Investor_portion Loan_Grade Years_Employed Home_Ownership Annual_Income Verification Issued_MonYY Loan_Category State Debit_Payment_Ratio Delinquency_Frequency Credit_Earliest_MonYY Inquiries_Count Delinquency_Recency Public_Record_Recency Open_Credit_Line_Count Derogatory_Public_Record_Count Total_Credit Credit_Utilization_Rate Total_Credit_Line_Count Initial_Status Payments_is_36
364111 7.69 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0

This one entry above contains mostly NA records, should be dropped.

In [13]:
df = df[~df["Loan_Requested"].isna()]
In [14]:
df.head(3)
Out[14]:
Interest_Rate Loan_Requested Loan_Funded Investor_portion Loan_Grade Years_Employed Home_Ownership Annual_Income Verification Issued_MonYY Loan_Category State Debit_Payment_Ratio Delinquency_Frequency Credit_Earliest_MonYY Inquiries_Count Delinquency_Recency Public_Record_Recency Open_Credit_Line_Count Derogatory_Public_Record_Count Total_Credit Credit_Utilization_Rate Total_Credit_Line_Count Initial_Status Payments_is_36
0 11.89 25000.0 25000.0 19080.0 B < 1 year RENT 85000.0 VERIFIED - income Aug-09 debt_consolidation CA 19.48 0.0 Feb-94 0.0 NaN NaN 10.0 0.0 28854.0 52.1 42.0 f 0
1 10.71 7000.0 7000.0 673.0 B < 1 year RENT 65000.0 not verified May-08 credit_card NY 14.29 0.0 Oct-00 0.0 NaN NaN 7.0 0.0 33623.0 76.7 7.0 f 0
2 16.99 25000.0 25000.0 24725.0 D 1 year RENT 70000.0 VERIFIED - income Aug-14 debt_consolidation NY 10.50 0.0 Jun-00 0.0 41.0 NaN 10.0 0.0 19878.0 66.3 17.0 f 0

02-3. Dealing with Mon-YY Column

In [15]:
# 'Issued_MonYY', 'Credit_Earliest_MonYY'
# Change both data into timestamp and subtract from today's timestamp to get how long has passed since the record as numeric record
from datetime import datetime

now_timestamp = datetime.timestamp(datetime.now())

df["Issued_MonYY"] = df["Issued_MonYY"].apply(
    lambda x: datetime.strptime(str(x), "%b-%y")
)
df["Issued_MonYY"] = df["Issued_MonYY"].apply(lambda x: datetime.timestamp(x))

df["Since_Issued"] = now_timestamp - df["Issued_MonYY"]
In [16]:
df["Credit_Earliest_MonYY"] = df["Credit_Earliest_MonYY"].apply(
    lambda x: datetime.strptime(str(x), "%b-%y")
)
In [17]:
df["Credit_Earliest_MonYY"] = df["Credit_Earliest_MonYY"].apply(
    lambda x: datetime.timestamp(x)
)
In [18]:
df["Since_First_Crdit"] = now_timestamp - df["Credit_Earliest_MonYY"]
In [19]:
df = df.drop(["Issued_MonYY", "Credit_Earliest_MonYY"], axis=1)
In [20]:
# df.info()
# df.head()
# df.isna().sum()

02-4.Impute the missing value for both "Delinquency_Recency" and "Public_Record_Recency"

In [21]:
# np.sort(df.Delinquency_Recency.unique())
# min: 0 , max: 188
# We suppose the NA value means there's never been a delinquency, we replace the NA with extreme value 999.
df["Delinquency_Recency"] = df.Delinquency_Recency.fillna(999)
# df.Delinquency_Recency.isna().sum()
In [22]:
# np.sort(df.Public_Record_Recency.unique())
# min: 0 , max: 129
# We suppose the NA value means there's never been a public record, we replace the NA with extreme value 999.
df["Public_Record_Recency"] = df.Public_Record_Recency.fillna(999)
# df.Public_Record_Recency.isna().sum()
In [23]:
# Since Interest_Rate is our target variable, we need valid value to train our model
# We drop whichever records with NA in Interest_Rate
df = df[~df["Interest_Rate"].isna()]
In [24]:
df.isna().sum()
Out[24]:
Interest_Rate                         0
Loan_Requested                        0
Loan_Funded                           0
Investor_portion                      0
Loan_Grade                        51866
Years_Employed                    14794
Home_Ownership                    51959
Annual_Income                     51751
Verification                          0
Loan_Category                         0
State                                 0
Debit_Payment_Ratio                   0
Delinquency_Frequency                 0
Inquiries_Count                       0
Delinquency_Recency                   0
Public_Record_Recency                 0
Open_Credit_Line_Count                0
Derogatory_Public_Record_Count        0
Total_Credit                          0
Credit_Utilization_Rate             224
Total_Credit_Line_Count               0
Initial_Status                        0
Payments_is_36                        0
Since_Issued                          0
Since_First_Crdit                     0
dtype: int64

02-5. Change the "Years_Employed" from categorical to numerical and impute the missing values using interpolate()

In [25]:
# Years_Employed                    14794
sns.histplot(df[df["Years_Employed"].notnull()]["Years_Employed"])
plt.show()
In [26]:
# Change df["Years_Employed"] to float
df["Years_Employed"] = df["Years_Employed"].apply(
    lambda x: x.replace("< 1", "0").replace("+", "").strip()[:2]
)
df["Years_Employed"] = df["Years_Employed"].astype("float")
df["Years_Employed"].describe()
Out[26]:
count    324195.000000
mean          5.991567
std           3.613908
min           0.000000
25%           3.000000
50%           6.000000
75%          10.000000
max          10.000000
Name: Years_Employed, dtype: float64
In [27]:
# Fill NaN values using an interpolation method
df["Years_Employed"] = df["Years_Employed"].interpolate()
# Check again the distribution using .describe()
df["Years_Employed"].describe()
Out[27]:
count    338989.000000
mean          5.996765
std           3.574865
min           0.000000
25%           3.000000
50%           6.000000
75%          10.000000
max          10.000000
Name: Years_Employed, dtype: float64

02-6. For "Home_Ownership" column, assign "NONE", "ANY" and NaN into the "OTHER" Category.

In [28]:
# Home ownership status: RENT, OWN, MORTGAGE, OTHER
# df["Home_Ownership"].unique(): ['RENT', 'OWN', 'MORTGAGE', 'NONE', NaN, 'OTHER', 'ANY']
# Since there should be only 4 status, we combine "NONE" and "ANY" into "Other"

df["Home_Ownership"] = (
    df["Home_Ownership"].str.replace("NONE", "OTHER").str.replace("ANY", "OTHER")
)

## Home_Ownership (NaN)                   51959
### Assign NA into group 'other'
df["Home_Ownership"] = df["Home_Ownership"].fillna("OTHER")
df["Home_Ownership"] = df["Home_Ownership"].astype("category")
In [29]:
df["Home_Ownership"].value_counts()
Out[29]:
MORTGAGE    145958
RENT        115958
OTHER        52097
OWN          24976
Name: Home_Ownership, dtype: int64

02-7. Change 'State' to 5 'Region's: 'West', 'Northeast', 'South', 'Midwest', 'Other'

In [30]:
states = {
    "AK": "Other",
    "AL": "South",
    "AR": "South",
    "AS": "Other",
    "AZ": "West",
    "CA": "West",
    "CO": "West",
    "CT": "Northeast",
    "DC": "Northeast",
    "DE": "Northeast",
    "FL": "South",
    "GA": "South",
    "GU": "Other",
    "HI": "Other",
    "IA": "Midwest",
    "ID": "West",
    "IL": "Midwest",
    "IN": "Midwest",
    "KS": "Midwest",
    "KY": "South",
    "LA": "South",
    "MA": "Northeast",
    "MD": "Northeast",
    "ME": "Northeast",
    "MI": "West",
    "MN": "Midwest",
    "MO": "Midwest",
    "MP": "Other",
    "MS": "South",
    "MT": "West",
    "NA": "Other",
    "NC": "South",
    "ND": "Midwest",
    "NE": "West",
    "NH": "Northeast",
    "NJ": "Northeast",
    "NM": "West",
    "NV": "West",
    "NY": "Northeast",
    "OH": "Midwest",
    "OK": "South",
    "OR": "West",
    "PA": "Northeast",
    "PR": "Other",
    "RI": "Northeast",
    "SC": "South",
    "SD": "Midwest",
    "TN": "South",
    "TX": "South",
    "UT": "West",
    "VA": "South",
    "VI": "Other",
    "VT": "Northeast",
    "WA": "West",
    "WI": "Midwest",
    "WV": "South",
    "WY": "West",
}


df["Region"] = df["State"].map(states)

df = df.drop("State", axis=1)
In [31]:
df["Region"] = df["Region"].astype("category")

02-8. Remove outliers in "Annual_Income" and impute missing value using interpolate()

In [32]:
## Annual_Income (NaN)                    51751
df["Annual_Income"].describe()
Out[32]:
count    2.872380e+05
mean     7.315146e+04
std      5.618967e+04
min      3.000000e+03
25%      4.500000e+04
50%      6.300000e+04
75%      8.807875e+04
max      7.500000e+06
Name: Annual_Income, dtype: float64
In [33]:
fig = px.box(df, x="Annual_Income")
fig.show()
# hover the mouse on the box to view of the box plot values
# Upper fence: 152.606k
In [34]:
# Outlier: > 152,606

df = df[(df["Annual_Income"] <= 152_606) | (df["Annual_Income"].isna())]
In [35]:
df["Annual_Income"].describe()
Out[35]:
count    275048.000000
mean      66397.971323
std       29090.473306
min        3000.000000
25%       45000.000000
50%       60000.000000
75%       84000.000000
max      152606.000000
Name: Annual_Income, dtype: float64
In [36]:
# Fill NaN values using an interpolation method
df["Annual_Income"] = df["Annual_Income"].interpolate()
# Check again the distribution using .describe()
df["Annual_Income"].describe()
Out[36]:
count    326799.000000
mean      66398.567132
std       27956.946520
min        3000.000000
25%       45000.000000
50%       61500.000000
75%       83000.000000
max      152606.000000
Name: Annual_Income, dtype: float64

02-9. Remove outliers in "Credit_Utilization_Rate" and impute missing value using interpolate()

In [37]:
## Credit_Utilization_Rate             224
df["Credit_Utilization_Rate"].describe()
Out[37]:
count    326599.000000
mean         56.245653
std          23.687279
min           0.000000
25%          39.500000
50%          57.800000
75%          74.800000
max         892.300000
Name: Credit_Utilization_Rate, dtype: float64
In [38]:
fig = px.box(df, y="Credit_Utilization_Rate")
fig.show()
# hover the mouse on the box to view of the box plot values
# Upper fence: 127.4
In [39]:
# Outlier: > 127.4
# df[df["Credit_Utilization_Rate"] > 127.4] : 9 entries
df = df[
    (df["Credit_Utilization_Rate"] <= 127.4) | (df["Credit_Utilization_Rate"].isna())
]
In [40]:
sns.displot(df["Credit_Utilization_Rate"], kde=True)
plt.show()
In [41]:
# Fill NaN values using an interpolation method
df["Credit_Utilization_Rate"] = df["Credit_Utilization_Rate"].interpolate()
# Check again the distribution using .describe()
sns.displot(df["Credit_Utilization_Rate"], kde=True)
plt.show()

02-10. Assign the missing values into a new loan grade called "O", as in "Others"

In [42]:
df["Loan_Grade"].unique()
Out[42]:
array(['B', 'D', 'C', 'A', 'E', 'F', nan, 'G'], dtype=object)
In [43]:
## Loan_Grade (NaN)                       51866
### Assign NA into a new group called 'O', Others.
df["Loan_Grade"] = df["Loan_Grade"].fillna("O")
df["Loan_Grade"] = df["Loan_Grade"].astype("category")
In [44]:
df.isna().sum()
# No Missing Values!
Out[44]:
Interest_Rate                     0
Loan_Requested                    0
Loan_Funded                       0
Investor_portion                  0
Loan_Grade                        0
Years_Employed                    0
Home_Ownership                    0
Annual_Income                     0
Verification                      0
Loan_Category                     0
Debit_Payment_Ratio               0
Delinquency_Frequency             0
Inquiries_Count                   0
Delinquency_Recency               0
Public_Record_Recency             0
Open_Credit_Line_Count            0
Derogatory_Public_Record_Count    0
Total_Credit                      0
Credit_Utilization_Rate           0
Total_Credit_Line_Count           0
Initial_Status                    0
Payments_is_36                    0
Since_Issued                      0
Since_First_Crdit                 0
Region                            0
dtype: int64

02-12. Create dummy variables for categorical variables.

In [45]:
df = pd.get_dummies(df, drop_first=True)
df.head()
Out[45]:
Interest_Rate Loan_Requested Loan_Funded Investor_portion Years_Employed Annual_Income Debit_Payment_Ratio Delinquency_Frequency Inquiries_Count Delinquency_Recency Public_Record_Recency Open_Credit_Line_Count Derogatory_Public_Record_Count Total_Credit Credit_Utilization_Rate Total_Credit_Line_Count Payments_is_36 Since_Issued Since_First_Crdit Loan_Grade_B Loan_Grade_C Loan_Grade_D Loan_Grade_E Loan_Grade_F Loan_Grade_G Loan_Grade_O Home_Ownership_OTHER Home_Ownership_OWN Home_Ownership_RENT Verification_VERIFIED - income source Verification_not verified Loan_Category_credit_card Loan_Category_debt_consolidation Loan_Category_educational Loan_Category_home_improvement Loan_Category_house Loan_Category_major_purchase Loan_Category_medical Loan_Category_moving Loan_Category_other Loan_Category_renewable_energy Loan_Category_small_business Loan_Category_vacation Loan_Category_wedding Initial_Status_w Region_Northeast Region_Other Region_South Region_West
0 11.89 25000.0 25000.0 19080.0 0.0 85000.0 19.48 0.0 0.0 999.0 999.0 10.0 0.0 28854.0 52.1 42.0 0 4.271869e+08 9.162073e+08 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 10.71 7000.0 7000.0 673.0 0.0 65000.0 14.29 0.0 0.0 999.0 999.0 7.0 0.0 33623.0 76.7 7.0 0 4.666717e+08 7.059133e+08 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
2 16.99 25000.0 25000.0 24725.0 1.0 70000.0 10.50 0.0 0.0 41.0 999.0 10.0 0.0 19878.0 66.3 17.0 0 2.694205e+08 7.164541e+08 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
3 13.11 1200.0 1200.0 1200.0 10.0 54000.0 5.47 0.0 0.0 64.0 999.0 5.0 0.0 2584.0 40.4 31.0 0 4.088665e+08 1.202882e+09 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
4 13.57 10800.0 10800.0 10692.0 6.0 32000.0 11.63 0.0 1.0 58.0 999.0 14.0 0.0 3511.0 25.6 40.0 0 4.192381e+08 8.268697e+08 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
In [46]:
df.shape
Out[46]:
(326790, 49)

02-13. Split the data into X (features) and y (label), which is the interest rate.

In [47]:
X, y = df.drop("Interest_Rate", axis=1), df["Interest_Rate"]

03. Data Modeling¶

03-1. Splitting X,y into training and testing set.

In [48]:
(X_train, X_test, y_train, y_test) = train_test_split(
    X, y, test_size=0.2, random_state=1
)

03-2. Import necessary library

In [49]:
# Import DecisionTreeRegressor from sklearn.tree
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as MSE
In [50]:
# from sklearn.model_selection import RandomizedSearchCV
In [51]:
rf = RandomForestRegressor(random_state=1)
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
# Number of features to consider at every split
max_features = ["auto", "sqrt"]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {
    "n_estimators": n_estimators,
    "max_features": max_features,
    "max_depth": max_depth,
    "min_samples_split": min_samples_split,
    "min_samples_leaf": min_samples_leaf,
    "bootstrap": bootstrap,
}
In [52]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
####rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 10 different combinations, and use all available cores
####rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 5, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
####rf_random.fit(X_train, y_train)
In [53]:
# rf_random.best_params_

03-3. Random Forest Regressor Model

In [54]:
# Instantiate a random forests regressor 'rf' 400 estimators
SEED = 1
rf = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.13, random_state=SEED)
# Fit 'rf' to the training set
rf.fit(X_train, y_train)
Out[54]:
RandomForestRegressor(min_samples_leaf=0.13, n_estimators=400, random_state=1)
In [56]:
# Compute Train / Test set RMSE

# Predict the labels of the training set
y_pred_train = rf.predict(X_train)

# Compute y_pred
y_pred_test = rf.predict(X_test)

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = -cross_val_score(
    rf, X_train, y_train, cv=10, scoring="neg_mean_squared_error", n_jobs=-1
)

# Evaluate the training set RMSE of rf
RMSE_train = (MSE(y_train, y_pred_train)) ** (1 / 2)

# Evaluate the test set RMSE of rf
RMSE_test = (MSE(y_test, y_pred_test)) ** (1 / 2)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean()) ** (1 / 2)

# Print RMSE_train
print("Train RMSE: {:.2f}".format(RMSE_train))

# Print RMSE_test
print("Test RMSE of rf: {:.2f}".format(RMSE_test))

# Print RMSE_CV
print("CV RMSE: {:.2f}".format(RMSE_CV))
Train RMSE: 3.85
Test RMSE of rf: 3.87
CV RMSE: 3.85
In [57]:
# Instantiate dt, Decision Tree
dt = DecisionTreeRegressor(max_depth=10, min_samples_leaf=0.13, random_state=3)

# Fit dt to the training set
dt.fit(X_train, y_train)
Out[57]:
DecisionTreeRegressor(max_depth=10, min_samples_leaf=0.13, random_state=3)
In [58]:
# Compute Train / Test set RMSE

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Compute y_pred
y_pred_test = dt.predict(X_test)

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = -cross_val_score(
    dt, X_train, y_train, cv=10, scoring="neg_mean_squared_error", n_jobs=-1
)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train)) ** (1 / 2)

# Evaluate the test set RMSE of dt
RMSE_test = (MSE(y_test, y_pred_test)) ** (1 / 2)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean()) ** (1 / 2)

# Print RMSE_train
print("Train RMSE: {:.2f}".format(RMSE_train))

# Print RMSE_test
print("Test RMSE of dt: {:.2f}".format(RMSE_test))

# Print RMSE_CV
print("CV RMSE: {:.2f}".format(RMSE_CV))
Train RMSE: 3.65
Test RMSE of dt: 3.67
CV RMSE: 3.65
In [104]:
# Instantiate dt, Decision Tree
dt2 = DecisionTreeRegressor(max_depth=10, min_samples_leaf=0.05, random_state=3)

# Fit dt to the training set
dt2.fit(X_train, y_train)


# Compute Train / Test set RMSE

# Predict the labels of the training set
y_pred_train = dt2.predict(X_train)

# Compute y_pred
y_pred_test = dt2.predict(X_test)

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = -cross_val_score(
    dt2, X_train, y_train, cv=10, scoring="neg_mean_squared_error", n_jobs=-1
)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train)) ** (1 / 2)

# Evaluate the test set RMSE of dt
RMSE_test = (MSE(y_test, y_pred_test)) ** (1 / 2)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean()) ** (1 / 2)

# Print RMSE_train
print("Train RMSE: {:.2f}".format(RMSE_train))

# Print RMSE_test
print("Test RMSE of dt: {:.2f}".format(RMSE_test))

# Print RMSE_CV
print("CV RMSE: {:.2f}".format(RMSE_CV))
Train RMSE: 2.97
Test RMSE of dt: 3.00
CV RMSE: 2.98
In [60]:
importances_dt = pd.Series(dt.feature_importances_, index=X.columns)
# Sort importances_rf
sorted_importances_dt = importances_dt.sort_values(ascending=False).head(5)
# Make a horizontal bar plot
sorted_importances_dt.plot(kind="barh", color="blue")
plt.show()

04. Fitting the test data (Holdout for Testing.csv)¶

04-1. Read in the data with necessary columns.

In [61]:
dftest = pd.read_csv("Holdout for Testing.csv", usecols=columns_subset)
dftest = dftest.drop("X1", axis=1)
In [62]:
column_rename = {
    # "X1": "Interest_Rate",
    "X4": "Loan_Requested",
    "X5": "Loan_Funded",
    "X6": "Investor_portion",
    "X7": "Number_of_Payments",
    "X8": "Loan_Grade",
    "X11": "Years_Employed",
    "X12": "Home_Ownership",
    "X13": "Annual_Income",
    "X14": "Verification",
    "X15": "Issued_MonYY",
    "X17": "Loan_Category",
    "X20": "State",
    "X21": "Debit_Payment_Ratio",
    "X22": "Delinquency_Frequency",
    "X23": "Credit_Earliest_MonYY",
    "X24": "Inquiries_Count",
    "X25": "Delinquency_Recency",
    "X26": "Public_Record_Recency",
    "X27": "Open_Credit_Line_Count",
    "X28": "Derogatory_Public_Record_Count",
    "X29": "Total_Credit",
    "X30": "Credit_Utilization_Rate",
    "X31": "Total_Credit_Line_Count",
    "X32": "Initial_Status",
}

dftest = dftest.rename(columns=column_rename)

04-2. Replicate the cleaning and preprcessing steps for train data on holdout test data

In [63]:
# Data type - From "String" To "Float"

## Rate - remove % and change to float type
dftest["Credit_Utilization_Rate"] = (
    dftest["Credit_Utilization_Rate"].str.slice(stop=-1).astype("float")
)

## Dollar Amount - remove $ and change to integer type
for col in ["Loan_Requested", "Loan_Funded", "Investor_portion"]:
    dftest[col] = dftest[col].str.slice(start=1).str.replace(",", "").astype("float")

# Data type - To Category
## Number of Payments - only 2: 36 months/ 60 months
# Change to 'Payments_is_36' column with 2 value: (1, 0)
dftest["Payments_is_36"] = (dftest["Number_of_Payments"] == " 36months").map(
    {False: 0, True: 1}
)
dftest = dftest.drop("Number_of_Payments", axis=1)

## Simply change to Category
Category = [
    "Loan_Grade",  # No need to reassign groups, ex.NaN
    "Years_Employed",
    "Home_Ownership",  # No need to reassign groups, ex.'NONE', 'ANY', NaN
    "Verification",
    "Loan_Category",
    "State",
    "Initial_Status",
]
for col in Category:
    dftest[col] = dftest[col].astype("category")
In [64]:
dftest.isna().sum()
Out[64]:
Loan_Requested                        0
Loan_Funded                           0
Investor_portion                      0
Loan_Grade                            0
Years_Employed                     4382
Home_Ownership                        0
Annual_Income                         0
Verification                          0
Issued_MonYY                          0
Loan_Category                         0
State                                 0
Debit_Payment_Ratio                   0
Delinquency_Frequency                 0
Credit_Earliest_MonYY                 0
Inquiries_Count                       0
Delinquency_Recency               38704
Public_Record_Recency             66161
Open_Credit_Line_Count                0
Derogatory_Public_Record_Count        0
Total_Credit                          0
Credit_Utilization_Rate              30
Total_Credit_Line_Count               0
Initial_Status                        0
Payments_is_36                        0
dtype: int64
In [65]:
dftest.head(3)
Out[65]:
Loan_Requested Loan_Funded Investor_portion Loan_Grade Years_Employed Home_Ownership Annual_Income Verification Issued_MonYY Loan_Category State Debit_Payment_Ratio Delinquency_Frequency Credit_Earliest_MonYY Inquiries_Count Delinquency_Recency Public_Record_Recency Open_Credit_Line_Count Derogatory_Public_Record_Count Total_Credit Credit_Utilization_Rate Total_Credit_Line_Count Initial_Status Payments_is_36
0 6000.0 6000.0 6000.0 C 10+ years MORTGAGE 68000.0 VERIFIED - income 15-Mar debt_consolidation OH 28.31 0 2-Nov 1 26.0 NaN 18 0 19861 64.5 33 f 0
1 24000.0 24000.0 24000.0 A 8 years RENT 110480.0 VERIFIED - income 15-Mar debt_consolidation CA 16.03 0 Dec-68 1 NaN NaN 12 0 17001 26.2 36 w 0
2 35000.0 35000.0 35000.0 C 10+ years MORTGAGE 86000.0 VERIFIED - income 15-Mar debt_consolidation PA 32.49 0 Oct-98 0 NaN NaN 16 0 25797 49.9 33 w 0
In [72]:
# dftest["Issued_MonYY"].unique()
dftest = dftest.sort_values("Credit_Earliest_MonYY").reset_index()
# dftest.head()
In [73]:
# 'Credit_Earliest_MonYY' Date format is not consistent, need to be reformatted and changed to datetime object
# Explore: dftest["Credit_Earliest_MonYY"][34000:34100]
# After reviewing, we know from 0~34016 are in "%d-%b" format(this year), and after that are all "%b-%y"
from datetime import datetime


dftest["Credit_Earliest_MonYY"][:34017] = dftest["Credit_Earliest_MonYY"][:34017].apply(
    lambda x: datetime.strptime(str(x + "-22"), "%d-%b-%y")
)

dftest["Credit_Earliest_MonYY"][34017:] = dftest["Credit_Earliest_MonYY"][34017:].apply(
    lambda x: datetime.strptime(str(x), "%b-%y")
)

# 'Issued_MonYY' change to datetime object
dftest["Issued_MonYY"] = dftest["Issued_MonYY"].apply(
    lambda x: datetime.strptime(str(x + "-22"), "%d-%b-%y")
)
/var/folders/08/336_pmn17t56r5jjv51n_8840000gn/T/ipykernel_64897/1547829543.py:7: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/08/336_pmn17t56r5jjv51n_8840000gn/T/ipykernel_64897/1547829543.py:11: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [74]:
# 'Issued_MonYY', 'Credit_Earliest_MonYY'
# Change both data into timestamp and subtract from today's timestamp to get how long has passed since the record as numeric record

now_timestamp = datetime.timestamp(datetime.now())


dftest["Issued_MonYY"] = dftest["Issued_MonYY"].apply(lambda x: datetime.timestamp(x))
dftest["Since_Issued"] = now_timestamp - dftest["Issued_MonYY"]

dftest["Credit_Earliest_MonYY"] = dftest["Credit_Earliest_MonYY"].apply(
    lambda x: datetime.timestamp(x)
)
dftest["Since_First_Crdit"] = now_timestamp - dftest["Credit_Earliest_MonYY"]


dftest = dftest.drop(["Issued_MonYY", "Credit_Earliest_MonYY"], axis=1)
In [84]:
dftest = dftest.sort_values("index")
dftest = dftest.drop("index", axis=1)
In [86]:
# Delinquency_Recency (NaN)              38704
dftest["Delinquency_Recency"] = dftest.Delinquency_Recency.fillna(999)
In [87]:
# Public_Record_Recency (NaN)              66161
dftest["Public_Record_Recency"] = dftest.Public_Record_Recency.fillna(999)
In [88]:
# Change dftest["Years_Employed"] from str to float
dftest["Years_Employed"] = dftest["Years_Employed"].apply(
    lambda x: x.replace("< 1", "0").replace("+", "").strip()[:2]
)
dftest["Years_Employed"] = dftest["Years_Employed"].astype("float")
# Fill NaN values using an interpolation method
dftest["Years_Employed"] = dftest["Years_Employed"].interpolate()
In [89]:
dftest["Region"] = dftest["State"].map(states)
dftest["Region"] = dftest["Region"].astype("category")
dftest = dftest.drop("State", axis=1)
In [90]:
# Credit_Utilization_Rate (NaN)             30
dftest["Credit_Utilization_Rate"] = dftest["Credit_Utilization_Rate"].interpolate()
In [91]:
dftest.isna().sum()
Out[91]:
Loan_Requested                    0
Loan_Funded                       0
Investor_portion                  0
Loan_Grade                        0
Years_Employed                    0
Home_Ownership                    0
Annual_Income                     0
Verification                      0
Loan_Category                     0
Debit_Payment_Ratio               0
Delinquency_Frequency             0
Inquiries_Count                   0
Delinquency_Recency               0
Public_Record_Recency             0
Open_Credit_Line_Count            0
Derogatory_Public_Record_Count    0
Total_Credit                      0
Credit_Utilization_Rate           0
Total_Credit_Line_Count           0
Initial_Status                    0
Payments_is_36                    0
Since_Issued                      0
Since_First_Crdit                 0
Region                            0
dtype: int64

04-3. Create dummy variables for categorical variables.

In [92]:
dftest = pd.get_dummies(dftest, drop_first=True)
dftest.head()
Out[92]:
Loan_Requested Loan_Funded Investor_portion Years_Employed Annual_Income Debit_Payment_Ratio Delinquency_Frequency Inquiries_Count Delinquency_Recency Public_Record_Recency Open_Credit_Line_Count Derogatory_Public_Record_Count Total_Credit Credit_Utilization_Rate Total_Credit_Line_Count Payments_is_36 Since_Issued Since_First_Crdit Loan_Grade_B Loan_Grade_C Loan_Grade_D Loan_Grade_E Loan_Grade_F Loan_Grade_G Home_Ownership_OWN Home_Ownership_RENT Verification_VERIFIED - income source Verification_not verified Loan_Category_credit_card Loan_Category_debt_consolidation Loan_Category_home_improvement Loan_Category_house Loan_Category_major_purchase Loan_Category_medical Loan_Category_moving Loan_Category_other Loan_Category_renewable_energy Loan_Category_small_business Loan_Category_vacation Loan_Category_wedding Initial_Status_w Region_Northeast Region_Other Region_South Region_West
11980 6000.0 6000.0 6000.0 10.0 68000.0 28.31 0 1 26.0 999.0 18 0 19861 64.5 33 0 2.897033e+07 8.925535e+06 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
42325 24000.0 24000.0 24000.0 8.0 110480.0 16.03 0 1 999.0 999.0 12 0 17001 26.2 36 0 2.897033e+07 -1.445276e+09 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
74927 35000.0 35000.0 35000.0 10.0 86000.0 32.49 0 0 999.0 999.0 16 0 25797 49.9 33 0 2.897033e+07 7.690727e+08 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
48469 10000.0 10000.0 10000.0 10.0 30000.0 32.96 0 1 999.0 114.0 13 1 9586 43.8 21 0 2.897033e+07 7.584419e+08 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
8832 24000.0 24000.0 24000.0 10.0 82500.0 31.03 0 0 48.0 999.0 27 0 31842 41.3 43 0 2.897033e+07 6.329935e+06 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
In [93]:
# Check if there are 48 features (same as X)
dftest.shape
Out[93]:
(80000, 45)
In [94]:
# X.columns
# dftest.columns
# short of these 3 columns: 'Loan_Grade_O' 'Home_Ownership_OTHER' 'Loan_Category_educational'
In [95]:
missing = ["Loan_Grade_O", "Home_Ownership_OTHER", "Loan_Category_educational"]
for col in missing:
    dftest[col] = pd.DataFrame(["0"] * len(dftest)).astype("uint8")
In [96]:
# FutureWarning:

# The feature names should match those that were passed during fit. Starting version 1.2,
# an error will be raised. Feature names must be in the same order as they were in fit.
# X.columns
cols = [
    "Loan_Requested",
    "Loan_Funded",
    "Investor_portion",
    "Years_Employed",
    "Annual_Income",
    "Debit_Payment_Ratio",
    "Delinquency_Frequency",
    "Inquiries_Count",
    "Delinquency_Recency",
    "Public_Record_Recency",
    "Open_Credit_Line_Count",
    "Derogatory_Public_Record_Count",
    "Total_Credit",
    "Credit_Utilization_Rate",
    "Total_Credit_Line_Count",
    "Payments_is_36",
    "Since_Issued",
    "Since_First_Crdit",
    "Loan_Grade_B",
    "Loan_Grade_C",
    "Loan_Grade_D",
    "Loan_Grade_E",
    "Loan_Grade_F",
    "Loan_Grade_G",
    "Loan_Grade_O",
    "Home_Ownership_OTHER",
    "Home_Ownership_OWN",
    "Home_Ownership_RENT",
    "Verification_VERIFIED - income source",
    "Verification_not verified",
    "Loan_Category_credit_card",
    "Loan_Category_debt_consolidation",
    "Loan_Category_educational",
    "Loan_Category_home_improvement",
    "Loan_Category_house",
    "Loan_Category_major_purchase",
    "Loan_Category_medical",
    "Loan_Category_moving",
    "Loan_Category_other",
    "Loan_Category_renewable_energy",
    "Loan_Category_small_business",
    "Loan_Category_vacation",
    "Loan_Category_wedding",
    "Initial_Status_w",
    "Region_Northeast",
    "Region_Other",
    "Region_South",
    "Region_West",
]

dftest = dftest[cols]
In [97]:
dftest_pred_dt2 = dt2.predict(dftest)
In [99]:
test = pd.read_csv("Holdout for Testing.csv")
predict = pd.DataFrame(
    {"Loan_Id": test["X2"], "Predicted_Interest_Rate": dftest_pred_dt2}
)
In [100]:
# "Titanic Results from" *insert your name or UChicago net ID.
predict.to_csv("Loan_Results_from_Kelsey-Liu.csv", index=False)

Interpretation:¶

Between 2 machine leaning models, Random Forest and Decision Tree, Decision Tree achieves slightly better result with the paramters (max_depth = 10, min_samples_leaf = 0.13) we originally set, as the RMSE for train, test, CV all being 0.2 lower than Random Forest's. Furthermore, when we adjust the parameter, the RMSE are about 0.7 lower than the baseline model. This is because the baseline model is said to underfit the data, which results in CV error(3.65) being very similar to training set error (3.65), but still greater than our desired error. To remedy underfitting, we can either increase max_depth or decease min_samples_leaf, here what I did is decreasing the min_samples_leaf from 0.13 to 0.05. If we want to achieve even better result, as there is not too much room for parameter tuning, we can consider including some of the columns that were excluded in the beginning (ex. do text mining on "Reason for loan"), dedicating more efforts into feature engineering to transform them into useful information, or gathering more relevant features.